Combining Optimal Clustering And Hidden Markov Models For Extractive Summarization
نویسندگان
چکیده
We propose Hidden Markov models with unsupervised training for extractive summarization. Extractive summarization selects salient sentences from documents to be included in a summary. Unsupervised clustering combined with heuristics is a popular approach because no annotated data is required. However, conventional clustering methods such as K-means do not take text cohesion into consideration. Probabilistic methods are more rigorous and robust, but they usually require supervised training with annotated data. Our method incorporates unsupervised training with clustering, into a probabilistic framework. Clustering is done by modified K-means (MKM)--a method that yields more optimal clusters than the conventional K-means method. Text cohesion is modeled by the transition probabilities of an HMM, and term distribution is modeled by the emission probabilities. The final decoding process tags sentences in a text with theme class labels. Parameter training is carried out by the segmental K-means (SKM) algorithm. The output of our system can be used to extract salient sentences for summaries, or used for topic detection. Content-based evaluation shows that our method outperforms an existing extractive summarizer by 22.8% in terms of relative similarity, and outperforms a baseline summarizer that selects the top N sentences as salient sentences by 46.3%.
منابع مشابه
Instructions for use Title Rhetorical Structure Modeling for Lecture Speech Summarization
We propose an extractive summarization system with a novel non-generative probabilistic framework for speech summarization. One of the most under-utilized features in extractive summarization is rhetorical information -semantically cohesive units that are hidden in spoken documents. We propose Rhetorical-State Hidden Markov Models (RSHMMs) to automatically decode this underlying structure in sp...
متن کاملRhetorical Structure Modeling for Lecture Speech Summarization
We propose an extractive summarization system with a novel non-generative probabilistic framework for speech summarization. One of the most under-utilized features in extractive summarization is rhetorical information -semantically cohesive units that are hidden in spoken documents. We propose Rhetorical-State Hidden Markov Models (RSHMMs) to automatically decode this underlying structure in sp...
متن کاملExtractive Chinese Spoken Document Summarization Using Probabilistic Ranking Models
The purpose of extractive summarization is to automatically select indicative sentences, passages, or paragraphs from an original document according to a certain target summarization ratio, and then sequence them to form a concise summary. In this paper, in contrast to conventional approaches, our objective is to deal with the extractive summarization problem under a probabilistic modeling fram...
متن کاملLearning to Model Domain-Specific Utterance Sequences for Extractive Summarization of Contact Center Dialogues
This paper proposes a novel extractive summarization method for contact center dialogues. We use a particular type of hidden Markov model (HMM) called Class Speaker HMM (CSHMM), which processes operator/caller utterance sequences of multiple domains simultaneously to model domain-specific utterance sequences and common (domainwide) sequences at the same time. We applied the CSHMM to call summar...
متن کاملCatching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization
We consider the problem of modeling the content structure of texts within a specific domain, in terms of the topics the texts address and the order in which these topics appear. We first present an effective knowledge-lean method for learning content models from unannotated documents, utilizing a novel adaptation of algorithms for Hidden Markov Models. We then apply our method to two complement...
متن کامل